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Interim  Rqwit 

ONR  Grant  N(X)014-93-l-0855 
Neural  Networks  Modeling 

for  Yield  Enhancement  in  Semiconductor  Manufacturing 

This  Interim  Report  covers  the  period 
July  1,  1993  -  December  31,  1993 


Introduction 

Major  tasks  planned  during  the  first  period  of  the  research  project  have  been  the  analysis 
of  data  and  evaluation  of  its  statistical  properties.  Selection  of  feasible  neural  network 
architectures  and  their  feasibility  study  have  been  pursued.  Also,  development  of  data 
manipulation  software  has  been  fhcilitat^  during  the  initial  phase  of  die  project  work  for  the 
purpose  of  properly  interfacing  data  with  the  nmiral  network  models. 

1.  Data  Release 

The  GaAs  fabrication  data  measured  und^  the  Microwave/MUlimeta*  Wave  Int^rated 
Circuit  (MIMIC)  Phase  1,  Task  4.E  Program  have  bera  made  available  to  the  project  team  in 
the  initial  phase  of  the  project.  Hiis  has  been  possible  as  a  result  of  the  data  release  (Case 
Number  ASC931796). 

Available  measurement  data  have  been  first  inspected  visually.  It  has  been  determined 
that  some  values  are  far  exceeding  range  of  measuring  devices.  Such  data  were  not  assumed 
as  credible  measurement  results  and  have  beoi  considered  missing  (not  takra).  This  has  been 
found  necessary  to  distinguish  missing  data  from  data  which  are  still  significant  but  are  distant 
from  their  mean  values. 

2.  Data  Manipulation  and  Evaluation  Software 

Test  structures  available  have  beat  measured  across  the  oitire  wafi^  in  one  test  sweq>. 
Computer  software  has  been  first  written  whidi  manipulate  the  raw  measurement  data.  The 
reticle  ID  identified  as  XXYY,  the  structure  ID  within  the  reticle  identified  as  xxyy,  and  the 
measured  value  are  placed  in  a  file.  The  characters  in  the  filename  have  bera  chosm  in  such 
a  manner  that  the  kbricator,  process  lot,  boule  source,  wafer  number,  the  characteristics 
measured,  and  the  processing  stage  at  which  the  measuremmt  was  performed  are  easily 
identitiable. 

The  majority  of  the  characteristics  have  been  measured  on  the  lx200/(m  MESFET.  The 
test  structure/device  (referred  to  as  a  device  from  this  point  on)  is  at  the  center  of  this  modding 
effort.  This  device  allows  for  a  comprehensive  DC  and  RF  characterization  of  the  FET 
equivalent  circuit  parameters.  For  network  training  purposes  we  have  found  it  initially  desirable 
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to  maintain  the  density  of  six  training  vectors  per  reticle.  Therefore,  the  data  have  been  taken 
at  6  distinct  locations  within  the  reticle.  Training  vectors  have  been  formed  by  assigning  each 
of  the  non-MESFET  characteristics  to  the  nearest-neighbor  MESFET.  This  has  Faulted  in  six 
complete  training  vectors  per  reticle. 

Training  vectors  need  to  be  subsequently  formed  for  each  of  the  neural  network  models. 
This  requires  files  of  training  vectors  be  created  as  training  files,  one  rq>resenting  each 
processing  stage,  and  then  for  later  stages  a  file  is  needed  containing  all  measurements  for  prior 
fiibrication  stages. 

The  manipulation  routine  initially  developed  searches  through  the  filenames  of 
characteristics  measured  on  a  user-selected  wafn*.  The  stage  at  which  the  measurement  was 
made  is  determined  by  a  specific  letter  in  the  filename.  A  large  table  representing  all 
measurements  on  the  wafer  has  been  formed.  Then,  for  user  requested  reticles  and  process 
stages,  files  of  training/test  data  vectors  are  created.  Another,  more  flexible  routine  is  currently 
in  preparation. 

A  routine  is  currently  under  development  which  calculates  the  statistics  of  the  input  and 
output  parameters.  This  routine  will  provide  insight  as  to  what  form  the  parameters  take,  and 
will  identify  outliers  and  statistical  moments.  Hie  computed  statistics  will  allow  to  bett^ 
determine  the  training  condition  in  the  situation  of  statistical  distribution  of  training  and  test  data. 

3.  Neural  Networks  Modeling  Software 

Inspection  of  data  indicates  that  feasible  architectures  for  modding  process  stages  of 
interest  are  those  of  multilayer  feedforward  percqitrons  (MFFNN).  In  order  to  fully  control  the 
training  process  and  have  clearly  defined  measures  of  training  errors,  the  MFPiw  training 
software  developed  in  house  has  been  adapted  to  project  needs.  The  software  allows  for 
increased  input/output  capabilities.  Each  network  component,  and  both  training  and  test 
functions  of  MFFNN  can  now  be  controlled  and  adequately  monitored  by  the  models,  which 
functions  have  been  rather  hard  to  implement  in  commercial  training  packages.  The  software 
has  been  written  in  C  language  and  can  be  executed  on  the  Unix  or  PC  platform. 

In  addition  to  the  adapted  training  and  recall  software  for  MFFNN,  sensitivity  analym 
package  has  been  developed.  The  software’s  objective  is  to  delete  inputs  which  are  irrelevant 
for  the  current  modeling  effort.  The  detailed  properties  of  the  smsitivity  analysis  algorithm  and 
its  implementation,  as  well  as  experimental  results  are  described  in  the  paper  appended  to  this 
rqwrt.  The  sensitivity  software  has  been  v^fied  on  a  numb^  of  cases  which  show  its 
relevancy  for  our  modeling  effort. 

4.  Fkelimiiiary  IVocess  Modeling  Effort 

To  determine  each  of  the  programs’  functionality  and  the  feasibility  of  the  process  stage 
models,  preliminary  process  stage  models  of  two  fabrication  stages  were  created.  The  post- 
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contact  to  post-recess  first  and  then  post-contact,  post-recess,  to  post-gate  process  stages  were 
modeled.  Initially,  data  from  only  one  wafer  was  chosen  for  modeling.  A  three  by  three  block 
of  reticles  in  each  of  three  of  the  four  quadrants  of  the  wafer  were  selected  for  modeling.  This 
resulted  in  54  (6x9)  training/test  vectors,  representing  nine  reticles  per  quadrant.  The  training 
vectors  were  shuffled  in  a  random  manner  before  training  and  test  files  were  created.  The  first 
40  vectors  of  the  shuffled  set  were  used  for  network  training  and  the  last  14  left  for  testing  the 
network.  Each  group  of  nine  reticles  was  modeled  separately.  Section  4. 1  and  4.2  discuss  the 
results  of  single  quadrant  modeling. 

4.1  Post-contact  to  Post-Recess  (c=  >r  or  C)  Models 

The  network  for  this  initial  model  has  six  inputs,  five  outputs,  and  12  hidden  neurons 
plus  the  bias  input.  The  size  of  the  hidden  layer  was  determined  experimentally  by  varying  the 
number  of  hidden  neurons  and  selecting  their  number  which  resulted  in  the  lowest  training  error 
over  a  number  of  training  sessions.  Graphs  of  the  actual  measured  and  modeled  values  are 
presented  in  Fig.  1.  Each  of  the  five  diagrams  displays  a  training  set  model  of  the  stage  and  it 
indicates  an  excellent  agreement  between  the  actually  measured  and  MFFNN-modeled  data. 

Fig.  2  contains  graphs  representing  the  training  results  for  each  of  the  ouq>uts  when  the 
network  is  presented  the  test  set.  The  network  now  representing  the  stage  model  has  never  seen 
the  test  set  data  before.  The  error  for  the  network-processed  new  data  is  several  percent  on  the 
average.  It  can  be  seen  that  the  developed  network  models  for  the  C  stage  provides  good 
prediction  of  process  outputs  on  the  test  set.  Note  that  the  error  appears  larger  on  the  figures 
than  it  really  is  since  the  axes  for  parameter  values  do  not  start  at  zero  values. 

The  preliminary  models  for  the  C  stage  shown  in  Figs.  1  and  2  produce  the  following 
stage  output  data:  I4,  (saturation  current),  (peak  value  of  saturation  current),  Rj,  (drain- 
source  resistance),  r^ji^v  (slope  of  R,  characteristics),  and  r^^  (intercept  of  characteristics). 
Inputs  for  this  stage  are  initial  measurements  of  I^,  Rj,,  R,,  Reh  (channel  resistance),  c^  (slope 
of  Re  characteristics),  and  Ca,f»  (intercept  of  R^  characteristics). 

4.2  Post-contact,  Post-recess,  to  Post-gate  (cr=  >g  or  R) 

The  network  for  this  stage  model  has  eleven  inputs  plus  the  bias  input.  Six  inputs  are 
post-contact  and  Eve  are  post-recess.  The  network  has  13  outputs,  and  18  hidden  neurons.  The 
size  of  the  hidden  layer  was  again  determined  experimentally  by  varying  the  number  of  hiddoi 
neurons  and  selecting  their  number  which  result^  in  the  lowest  training  error  over  a  number 
of  training  sessions.  Graphs  of  the  actual  measured  and  modeled  values  for  this  stage  are 
presented  on  Figs.  3  and  4.  The  graphs  contain  similar  information  as  discussed  earlio'  in 
Section  4. 1  for  the  previous  process  stage. 

Twelve  of  thirteen  modeled  outputs  are  dq)icted  on  the  graphs.  They  are  Ij,  (drain  to 
source  saturation  current),  R^,  (drain-source  resistance),  R,,  (gate-source  resistance),  R,  (source 
resistance),  Rj  (drain  resistance),  (saturation  voltage),  (breakdown  voltage  drain-to- 
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gate),  V,^,,  (breakdown  voltage  source-to-gate),  Vp^  (pinch-off  voltage),  G.  (mutual 
tiansconductance),  (peak  value  of  the  drain-to-source  saturation  current),  e.  (electrical 
alignment).  Inputs  to  this  stage  model  are  the  following  parameters:  five  of  these  are  post- 
recess  outputs  listed  earlier  in  Section  4.1,  and  six  are  post-contact  inputs  listed  as  inputs  to  the 
C  stage  models. 

While  the  network-computed  outputs  on  the  training  set  are  in  v^  good  to  excellent 
agreement  with  the  measurements,  the  test  set  response  of  MFFNN  yidds  some  differences 
between  the  measured  and  predicted  ou^uts.  It  can  be  seen  that  I^,  values  tend  to  be  avnaged 
over  the  set  (see  Fig.  4a).  Prediction  of  R,,  and  seem  to  be  less  than  satisfactory. 
Prediction  of  the  remaining  parameters  is  rather  accurate.  Outliers  among  the  test  data,  as 
expected,  tend  to  produce  larger  errors. 

5.  Preliminary  Conclusions  from  Current  Modeling  Efforts 

A  number  of  important  issues  emerged  as  the  MFFNN  training  files  were  created.  First 
of  all,  most  of  the  substrate  data  or  (s)  stage  as  it  is  referred  to  in  the  work  is  obtained  thrcnigh 
destructive  testing.  That  is,  once  characterization  tests  are  performed,  the  wafer  is  of  no  further 
use  in  circuit  fabrication.  Therefore,  only  a  few  wafers  per  boule  are  diaiacterized  this  way  and 
thdr  data  is  considered  to  be  representative  of  the  other  wafers  from  the  boule  used  for  circuit 
fabrication.  It  was  also  discove^  that  these  tests  were  not  performed  on  every  boule,  therefore 
leaving  some  wafers  without  substrate  characterization.  This  has  been  die  case  for  the  wafer 
which  was  selected  for  the  preliminary  modeling.  The  wafer  for  the  preliminary  modeling  effort 
was  selected  at  random.  (5are  should  be  taten  in  the  future  work  to  ensure  that  the  wafers  used 
for  network  training  have  proper  substrate  characterization  data. 

The  substrate  measurements  were  taken  on  the  bare  substrate  and  are  rqwrted  in 
millimeters.  Since  the  measurements  are  takra  prior  to  the  actual  fabrication,  the  data  is  not 
suited  for  coordinates  XXYY,xxyy  which  is  the  structure  identification  scheme  used  for  all  other 
measurements.  A  routine  needs  to  be  created  to  convert  this  data  in  such  a  mann^  as  to  assign 
it  to  the  MBSFET  which  is  closest  to  the  point  of  measuremoit. 

It  has  also  been  discovered  that  characteristics  such  as  metalization  thickness  and  width 
for  the  differrat  process  stages  were  measured  at  final  stage.  The  test  has  been  performed  at 
the  final  stage,  but  these  are  characteristics  which  rqiresent  earlier  stages  and  need  to  be  shifted 
back  to  thdr  respective  stage  models  for  earlier  use.  The  reason  the  measuremrats  wore  made 
at  final  rather  than  at  the  spedfic  test  stage  is  that  testing  talms  a  long  time  and  slows  down 
wafer  fabrication.  Therefore,  since  some  characteristics  are  not  affected  by  subsequent 
processing  stages,  their  measurements  are  deferred  until  the  final  test. 

The  data  manipulation  routines  are  therefore  currently  being  modified  to  allow  the  user 
to  request  spedfic  characteristics  rather  than  ones  which  have  a  certain  time  index  of  test.  This 
is  also  being  done  so  the  user  may  request  data  on  specific  reticles  within  a  wafer  or  data  for 
the  entire  set  of  wafers.  Once  finished,  complete  data  sets  for  modeling  of  each  of  the  stages 
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for  specific  reticles  or  entire  wafers  will  be  readily  available. 

Similar  modeling  of  the  process  stages  discussed  in  this  rqx>rt  will  be  performed  for 
remaining  stages.  Once  the  data  manipulation  and  statistical  routines  are  completed,  and  the 
preliminary  modeling  results  are  analyzed,  then  most  of  the  project  focus  will  be  shifted  towards 
optimizing  network  architectures  and  minimizing  the  modeling  errors  for  each  processing  stage. 

6.  Appendix:  Paper  in  1994  Int.  IEEE  Symposium  on  Circuits  and  Systems:  "Sensitivity 
Analysis  for  Minimization  of  Input  Data  Dimension  for  Feedforward  Neural  Network.” 
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Fig.  1.  Sensitivity  profile  during  training  for  the  fiill  training  set. 
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Fig.  4.  Input  significance  <()  evaluated  using  different  overall  sensitivities  (6)-(8) 
and  pruning  criterion  (16). 
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Fig.  5.  Input  significance  changes  during  training  for  the  full  training  set. 
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Fig.  6.  Input  significance  <J)avg  changes  during  training  for  the  pruned  training  set. 
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Fig.  2  Neural  Network  Model 
of  the  C  Stage  Tested  on  the 
Test  Data 
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Sensitivity  Analysis  for  Minimization  of  Input  Data  Di¬ 
mension  for  Feedforward  Neural  Network 
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Abstract:  Multilayer  feedforward  networks  are  often  used  for  modeling  complex  relation¬ 
ships  between  the  data  sets.  Deleting  unimportant  data  components  in  the  training  sets  could 
lead  to  smaller  networks  and  reduced-size  data  vectors.  This  can  be  achieved  by  analyzing 
the  total  disturbance  of  network  outputs  due  to  perturbed  inputs.  The  search  for  redundant 
data  components  is  performed  for  networks  with  continuous  outputs  and  is  based  on  the  con¬ 
cept  in  sensitivity  of  linearized  neural  networks.  Hie  formalize  criteria  and  algorithm  for 
pruning  data  vectors  are  formulated  and  illustrated  with  examples. 

Introduction 

Neural  networks  are  often  used  to  model  complex  functional  relationships  between  sets  of 
experimental  data.  This  is  particularly  useful  when  an  analytical  model  of  a  process  either  does  not 
exist  or  it  is  not  known,  but  when  sufficient  data  is  available  for  embedding  relationships  existing 
between  two  or  more  data  bases  into  a  neural  network  model.  Representative  data  can  be  used  in  such 
case  to  perform  supervised  training  of  a  suitable  neurocomputing  architecture.  Multilayer  feedfor¬ 
ward  neural  networks  (MFNN)  have  been  found  especially  efficient  for  this  purpose  [1,  2].  The 
minimization  of  redundancy  in  the  training  data  is,  however,  an  important  issue  and  rather  rarely 
addressed  in  the  technical  literature.  MFNN  considered  here  are  trained  using  the  popular  error  back- 
propagation  technique  in  order  to  perform  the  feedforward  process  identiHcation  [3]. 

Let  us  consider  a  MFNN  with  a  single  hidden  layer.  The  network  performs  a  nonlinear  and 
constrained  mapping  o=r(x),  where  o  (Kxl),  and  x  (Ixl)  are  output  and  input  vectors,  respectively. 
It  is  assumed  that  certain  inputs  bear  none,  or  little,  statistical  or  deterministic  relationships  to  out¬ 
puts  and  input  vectors  could  therefore  be  compressed.  The  objective  of  this  study  is  to  reduce  the 
dimensionality  of  the  input  vector,  x,  and  thus  to  prune  the  input  data  set,  so  that  a  smaller  network 
can  be  utilized  as  a  model  of  relationship  between  the  data.  Initial  findings  on  this  subject  have  been 
published  in  [4-6].  This  paper  introduces  a  more  general  and  formal  approach  to  reduction  of  input 
size  of  the  network.  The  sensitivity  approach  can  also  be  used  to  delete  weights  which  are  unimpor¬ 
tant  for  neural  network  performance  as  it  has  been  proposed  in  [7]. 


Sensitivities  to  Inputs 

Let  us  define  the  sensitivity  of  a  trained  MFNN  output,  Ok,  with  respect  to  its  input  Xi  as 
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which  can  be  written  succinctly  as 
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By  using  the  standard  notation  of  an  error  backpropagation  approach  [3],  the  derivative  of  (la)  can 
be  readily  expressed  in  terms  of  network  weights  as  follows 
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where  yj  denotes  the  output  of  the  j-th  neuron  of  the  hidden  layer,  and  o^’  is  the  value  of  derivative 
of  the  activation  function  o=:f(net)  at  the  k-th  output  neuron.  This  further  yields 
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where  yj’  is  the  value  of  derivative  of  the  activation  hinction  y=f(net)  of  the  j-th  hidden  neuron 
(yj’=0  since  the  J-th  neuron  is  a  dummy  one,  i.e.  it  serves  as  a  bias  input  to  the  output  layer).  The 
sensitivity  matrix  S  (Kxl)  consisting  of  entries  as  in  (3)  or  (lb)  can  now  be  expressed  using  array 
notation  as 

5  =  O'  X  W  X  Y'  X  V  (4) 


W  (KxJ)  and  V  (Jxl)  are  output  and  hidden  layer  weight  matrices,  respectively,  and  O’  (KxK)  and 
Y’  (JxJ)  are  diagonal  matrices  defined  as  follows 


O'  ^  diagip^',  Oj',  ...,  o/)  (5) 

r  ±  <fias(y,',  y,',  ...,  y/) 

Matrix  S  contains  entries  Su  which  are  ratios  of  absolute  increments  of  output  k  due  to  the  input  i 
as  defined  in  (lb).  This  matrix  depends  only  upon  the  network  weights  as  well  as  slopes  of  the  activa¬ 
tion  functions  of  all  neurons.  Each  training  vector  x(“)E9B,  where  96={x(^),  x(2), ...,  x(*^}  denotes 
the  training  set,  produces  different  sensitivity  matrix  S(°)  even  for  a  flxed  network.  This  is  due  to 
the  fact  that  although  weights  of  a  trained  network  remain  constant,  the  activation  values  of  neurons 


change  across  Ihe  set  of  training  vectors  jK"),  n=l,  2 . N.  This,  in  turn,  produces  different  diagonal 

matrices  of  derivatives  O’  and  Y’,  which  strongly  depend  upon  the  neurons’  operating  points  deter¬ 
mined  by  their  activation  values. 


Measures  of  Sensitivity  over  a  Training  Set 

In  order  to  possibly  reduce  the  dimensionality  of  input  vectors,  the  sensitivity  matrix  as  in 
(4)  needs  to  be  evaluated  over  the  entire  training  set  96.  Let  us  define  the  sensitivity  matrix  for  the 
pattern  Xn  as  SK").  There  are  several  ways  to  define  the  overall  sensitivity  matrix,  each  relating  to  the 
different  objective  functions  which  need  to  be  minimized. 

The  mean  square  average  sensitiviiies,  S^i,  avg>  over  the  set  96  can  be  computed  as 


S 


(6) 


Matrix  S,vg  (Kxl)  is  defined  as  [Savg]=Su^vg-  This  method  of  sensitivity  averaging  is  coherent  with 
the  goal  of  network  training  which  minimizes  the  mean  square  error  over  all  outputs  and  all  patterns 
in  the  set. 

The  absolute  value  average  sensitivities,  abs.  over  the  set  96  can  be  computed  as 


(7) 


Matrix  Sabs  (Kxl)  is  defined  as  [Sabs]=Ski,ab8«  Note  that  summing  sensitivities  across  the  training  set 
requires  taking  their  absolute  values  due  to  the  possibility  of  cancelations  of  negative  and  positive 
values.  This  method  of  averaging  may  be  better  than  (6)  if  sensitivities  n=l, ...,  N,  are  of  dispa¬ 

rate  values. 

The  maximum  sensitivities,  Sy,  niax.  over  the  set  96  can  be  computed  as 


=*=  1  (8) 

Matrix  Smax  (Kxl)  is  defined  as  [Snux]=Skip(iux’  This  sensitivity  definition  allows  to  prevent  pruning 
inputs  which  are  relevant  for  the  network  only  in  small  percentage  of  input  vectors  among  the  ydiole 
training  set  However,  it  can  happen  that  a  few  fuzzy  patterns  in  a  large  set  can  affect  entires  of  sensi- 


sensitivity  array  by  associating  fuzziness  with  additional  inputs.  Those  fuzzy  results  are  masked  in 
such  case  by  averaging  in  (6)-(7),  and  not  by  (8).  Therefore  the  significance  of  inputs  can  be  overes¬ 
timated  and  therefore  some  unimportant  inputs  may  remain  after  reducing  the  dimension. 

Any  of  the  sensitivity  measure  matrices  proposed  in  (6)-(8)  can  provide  useful  information 
as  to  the  relative  significance  of  each  of  the  inputs  in  96  to  each  of  the  outputs.  For  the  sake  of  simplic¬ 
ity,  however,  only  matrix  defined  in  (6)  will  be  used  in  further  discussion.  The  cumulative  statistical 
information  resulting  from  (6)  will  be  used  along  with  criteria  for  reducing  the  number  of  inputs  to 
the  smallest  number  sufficient  for  accurate  learning.  These  criteria  are  formulated  in  the  next  sec¬ 
tion. 

Criteria  for  Pruning  Inputs 

Inspection  of  the  average  sensitivity  matrix  S^yg  allows  to  determine  which  inputs  affect  out¬ 
puts  least.  A  small  value  of  S^^yg  in  comparison  to  others  means  that  for  the  particular  k-th  output 
of  the  network,  the  i-th  input  does  not  significantly  contribute  to  output  k,  and  may  therefore  be 
possibly  disregarded.  This  reasoning  and  results  of  experiments  allow  to  formulate  the  following 
practical  rule:  The  sensitivity  matrices  for  a  trained  neural  network  can  be  evaluated  for  both  train¬ 
ing  and  testing  data  sets;  the  values  of  average  sensitivity  matrix  entries  can  be  used  for  determining 
the  least  significant  inputs  and  for  reducing  the  size  of  network  accordingly  through  pruning  unnec¬ 
essary  inputs. 

When  that  one  or  more  of  the  inputs  have  relatively  small  sensitivity  in  comparison  to  others, 
the  dimension  of  neural  network  can  be  reduced  by  dropping  them,  and  smaller-size  neural  network 
can  be  successfully  retrained  in  most  cases.  The  criterion  used  in  this  paper  for  determining  which 
inputs  can  be  pruned  is  based  on  so  called  the  largest  gap  method. 

In  order  to  normalize  the  data  relevant  for  comparison  of  significance  of  inputs,  the  sensitiv¬ 
ity  matrix  defined  in  (6)-(8)  has  to  be  additionally  preprocessed.  The  formulas  often  used  for  scaling 
are  given  in  (9)  and  map  each  input  into  range  [0 ;  1]  and  each  output  output  into  range  [-1 ;  1]: 


If  input  and  output  data  scaling  (9)  was  performed  before  network  training,  no  additional  operations 
on  S]a  is  required  and  we  have 


(10) 


Note  that  the  scaling  can  be  performed  either  on  entries  of  S  or  Savg-  Experiments  were  performed 
also  for  scaling  inputs  into  range  [-1 ;  1].  Similar  results  were  achieved  for  the  same  learning  condi¬ 
tions.  The  latter  scaling  seems  to  fasten  the  learning  convergence  while  accuracy  and  relations 
among  sensitivities  remain  unchanged. 

In  case  when  network  original  inputs  and  outputs  are  not  scaled  to  the  same  level,  additional  scaling 
(11)  is  necessary  to  allow  for  accurate  comparison  among  inputs. 


'{.■“ik'l  -  .*k'l) 


The  significance  of  i-th  input  Oj  across  the  entire  set  96  is  defined  as: 

4>abs  ^max  ^  be  evaluated  similarly  to  ^abs  deHned  in  (12).  In  order  to  distinguish  inputs  with 

high  and  low  importance,  entries  of  <l>i  have  to  be  sorted  in  descending  order  so  that: 

2:  =  1,  ....  /  -  1  (13) 

where  im  is  a  sequence  of  sorted  input  numbers.  Let  us  deHne  the  measure  of  gap  as  (14) 


and  then  And  the  largest  gap  using  the  formula  (15). 


giux  -  ratxigj  and  t/icvr  -  «  such  that  =  gjta 

im 


If  condition  (16)  is  valid,  then  the  found  gap  between  mcur  and  mcur+i  is  large  enough. 


CgMx  >  .  max  {gj  (16) 

‘""■'VWT’ 

Constant  C  from  (16)  is  chosen  arbitrary  within  a  reasonable  range  (e.g.  C=0.5.  The  smaller  C  the 
stronger  is  condition  for  existence  of  the  acceptable  gap.)  All  inputs  with  index  {im-t-i-ii-i}  can  be 
pruned  with  the  smallest  loss  of  information  to  the  MFNN. 


The  gap  method  can  be  also  applied  for  comparison  among  sensitivities  of  inputs  to  each 
output  separately.  For  this  purpose,  a  set  containing  candidates  for  pruning  can  be  created  for  every 
output.  Final  pruning  is  performed  by  removing  these  inputs  which  can  be  found  in  every  set  deter¬ 
mined  previously  for  each  output  independently. 

Certainly,  Savg  can  be  evaluated  meaningfully  only  for  well  trained  neural  networks.  Despite 
this  disadvantage,  proposed  criteria  can  still  save  computational  effort  when  initial  learning  can  be 
performed  on  smaller,  but  still  representative  subset  of  data.  Savg  can  be  evaluated  based  either  on 
data  set  used  for  initial  training  or  on  complete  data  set.  Subsequently,  newly  developed  neural  net¬ 
work  with  appropriate  inputs  can  be  retrained  using  the  full  set  of  training  patterns  with  reduced  di¬ 
mension. 

Numerical  Examples 

A  series  of  numerical  simulations  was  performed  in  order  to  verify  the  proposed  deflnitions 
and  the  pruning  criteria.  In  the  first  experiment  a  training  set  for  a  neural  network  was  generated 
using  four  inputs  X1..X4  and  two  outputs  oi  and  03.  Values  of  outputs  were  correlated  with  xi  and 
X2  for  oi,  and  with  X2  and  X3  for  02.  Input  vectors  x  (4x1)  were  produced  using  a  random  number 
generator.  The  expected  values  of  vector  d  (2x1)  for  the  output  vector  o  (2x1)  were  evaluated  for 
each  X  using  a  known  relationship  dsF(x)  where  d  is  the  desired  (target)  output  vector  for  supervised 
training.  The  training  set  96  consisted  of  N=81  patterns.  A  neural  network  with  4  inputs,  2  outputs 
and  6  hidden  neurons  (1=5,  J=7,  K=2)  has  been  trained  for  the  mean  square  error  defined  as  in  (17) 


equal  0.001  per  input  vector.  Matrices  of  sensitivities  were  subsequently  evaluated  and  Sjvg  pro¬ 
duced  at  the  end  of  training  over  the  entire  input  data  set  96. 

The  changes  of  sensitivity  entries  during  learning  are  presented  in  Fig.  1.  It  can  be  seen  that 
an  untrained  neural  network  in  the  example  has  per  average  smaller  sensitivities  than  after  the  train¬ 
ing.  During  the  training  some  of  the  average  sensitivities  Sn^avg  increase,  while  other  converge  to¬ 
wards  low  values.  Final  values  of  sensitivities  of  the  flrst  output  offer  hints  for  deleting  X3  and  X4, 
and  these  for  the  second  output  indicate  that  xj  and  X4  could  be  deleted.  The  only  input  which  shows 
up  in  both  sets  candidates  for  deletion  is  X4.  Therefore,  the  fourth  input  to  the  network  can  be  skipped 
and  its  dimension  reduced  to  3  (1=4). 

After  deleting  X4  from  the  learning  data  set  the  new  network  with  3  inputs  was  trained  suc¬ 
cessfully  with  the  same  accuracy.  Tlie  learning  proxies  for  full  and  reduced  input  sets  for  the  same 


learning  conditions  are  compared  in  Fig.  2.  Not  only  the  network  with  3  inputs  trains  within  smaller 
number  of  cycles,  but  each  learning  cycle  is  performed  quicker  due  to  the  reduced  input  layer  size. 

If  an  input  not  recommended  for  pruning  is  erroneously  deleted,  the  network  was  found  un¬ 
able  to  learn  the  data  sets.  The  mean  square  error  per  pattern  has  remained  at  the  level  of  approxi¬ 
mately  0.25  as  it  is  shown  in  Fig.  2.  The  entries  of  the  sensitivity  matrix  remain  at  low  level  as  it 
is  shown  in  Fig.  3.  There  may  still  be  some  gap  between  entries,  but  it  cannot  be  used  for  pruning 
because  the  MFNN  has  not  learned  vectors  correctly  and  after  input  dimension  reduction  would  not 
be  able  to  leam  more  accurate.  Gap  which  can  be  seen  in  the  Fig.  3  has  a  meaning  that  for  the  insuffi¬ 
cient  accuracy  which  was  achieved  during  the  training,  only  one  input  could  be  left  in  the  network 
without  significant  deterioration  of  performance. 

The  second  experiment  was  performed  using  larger  network  and  fuzzy  data.  MFNN  had  20 
inputs  (1=21),  10  hidden  neurons  (J=26)  and  4  outputs  (K=4).  There  were  N=500  patterns  in  the 
training  set  and  several  additional  data  sets  of  the  same  size  for  network  performance  evaluation. 
The  network  was  successfully  trained  to  the  MSB  error  of  0.15.  However,  due  to  the  fuzziness  of 
the  data  MSB  error  for  additional  sets  remained  at  the  level  of  0.20. 

All  outputs  were  strongly  correlated  with  inputs  xi,  X2,  X3,  X4,  X5,  xg,  and  X9.  Input  xg  during 
data  generation  was  multiplied  by  random  numbers,  while  the  influence  of  X2  and  X4  on  outputs  was 
scaled  down  to  remain  small  in  comparison  to  other  inputs  (less  than  0.05). 

The  input  importances  calculated  using  formulas  (6)-(8)  are  shown  in  Fig.  4.  Inputs  X2  and 
X4  are  placed  even  after  sorting  as  less  important  than  some  of  them  which  are  not  correlated  at  all. 
This  occurred  because  of  their  low  correlation  to  outputs,  and  they  can  be  ignored  as  well  as  other 
not  correlated  for  given  MSB  error  as  a  final  condition  for  training.  The  sequence  of  significance 
is  the  same  for  all  proposed  methods,  however,  the  size  of  gaps  are  different  in  each  case.  Value 
C=0.5  prevents  pruning  using  tfiniax  deflnition.  Note  that  the  maximum  method  does  not  give  the 
clear  clue  where  to  set  the  level  for  purging  due  to  fuzziness  of  the  training  data. 

The  result  of  initial  training  is  shown  in  Fig.  5.  It  can  be  determined  from  this  frgure  whidi 
inputs  should  remain  after  pruning.  The  network  performance  after  pruning  is  shown  in  Fig.  6.  No 
additional  dimension  reduction  is  possible  because  no  large  gap  in  input  importances  can  be  found. 
The  speed  of  training  has  increased  mostly  because  of  reduction  of  the  MFNN  size  (input  dimension 
reduced  by  4).  The  necessary  number  of  cycles  for  training  has  also  decreased,  but  not  so  dramatical¬ 
ly  as  in  the  first  experiment 

Conclusions 

Using  the  sensitivity  approach  for  input  layer  pruning  seems  particularly  useful  when  net¬ 
work  training  requires  large  amount  of  redundant  data.  In  the  first  phase,  network  can  be  pre-trained 


until  the  training  error  decreases  satisfactorily.  Then  sensitivity  matrices  can  be  evaluated  and  di¬ 
mension  of  the  input  layer  possibly  reduced.  Learning  can  subsequently  be  resumed  unti  1  the  training 
error  reduces  to  acceptable  low  value.  This  process  can  be  repeated,  however,  usually  only  the  first 
execution  yields  significant  improvement.  Numerical  experiments  indicate  that  the  effort  of  addi¬ 
tional  network  retraining  can  be  too  high  in  comparison  to  benefits  of  further  minimization. 

Should  the  redundancy  in  training  data  vectors  exist,  the  proposed  approach  based  on  the 
average  sensitivity  matrices  for  input  data  pruning  allows  for  more  efficient  training.  This  can  be 
achieved  at  a  relatively  low  computational  cost  and  based  on  heuristic  data  pruning  criteria  outlined 
in  the  paper.  The  approach  can  be  combined  with  other  improved  training  strategies  such  as  in¬ 
creased  complexity  training  [5].  Extension  of  the  proposed  sensitivity-based  input  pruning  concept 
beyond  continuous  output  values  seems  desirable  for  case  of  networks  with  binary  outputs  such  as 
classiHers  and  other  binary  encoders. 
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